Faster unique, isdistinct, merge_sorted, and sliding_window. #178

eriknw · 2014-05-10T13:15:23Z

The key keyword argument to unique was changed from identity to None. This better matches API elsewhere, and lets us remove identity from being redefined in itertoolz, which always seemed a little weird.

Most of the speed improvements come from avoiding attribute resolution in frequently run code. Attribute resolution (i.e., the "dot" operator) is probably more costly than one would expect. Fortunately, there weren't many places to apply this optimization, so impact on code readability was minimal.

unique employs another optimization: branching by key is None outside the loop (thus requiring two loops). While this violates the DRY principle (and, hence, I would prefer not to do it in general), this is only a few lines of code that remain side-by-side, and the performance increase is worth it.

merge_sorted is now optimized when only a single iterable remains. This makes it so much faster while in this condition.

The `key` keyword argument to `unique` was changed from `identity` to `None`. This better matches API elsewhere, and lets us remove `identity` from being redefined in `itertoolz`, which always seemed a little weird. Most of the speed improvements come from avoiding attribute resolution in frequently run code. Attribute resolution (i.e., the "dot" operator) is probably more costly than one would expect. Fortunately, there weren't many places to apply this optimization, so impact on code readability was minimal. `unique` employs another optimization: branching by `key is None` outside the loop (thus requiring two loops). While this violates the DRY principle (and, hence, I would prefer not to do it in general), this is only a few lines of code that remain side-by-side, and the performance increase is worth it. `merge_sorted` is now optimized when only a single iterable remains. This makes it *so* much faster while in this condition.

Issue pytoolz#178 impressed upon me just how costly attribute resolution can be. In this case, `groupby` was made faster by avoiding resolving the attribute `list.append`. This implementation is also more memory efficient than the current version that uses a `defaultdict` that gets cast to a `dict`. While casting a defaultdict `d` to a dict as `dict(d)` is fast, it is still a fast *copy*. Honorable mention goes to the following implementation: ```python def groupby_alt(func, seq): d = collections.defaultdict(lambda: [].append) for item in seq: d[func(item)](item) rv = {} for k, v in iteritems(d): rv[k] = v.__self__ return rv ``` This alternative implementation can at times be *very* impressive. You should play with it!

mrocklin · 2014-05-10T14:00:58Z

toolz/itertoolz.py

@@ -120,7 +117,9 @@ def _merge_sorted_key(seqs, key):
    heapq.heapify(pq)

    # Repeatedly yield and then repopulate from the same iterator
-    while True:
+    heapreplace = heapq.heapreplace
+    heappop = heapq.heappop


Oh man, I never would have thought of this.

mrocklin · 2014-05-10T14:01:47Z

Do you have micro benchmarks to back up the value of these changes?

eriknw · 2014-05-10T14:20:00Z

Do you have micro benchmarks to back up the value of these changes?

You bet. The following are variations of unique, which also shows why something like pytoolz/cytoolz#22 would be awesome to have:

from toolz import unique  # original implementation, not from this PR
from cytoolz import unique as cyunique

def unique1(seq, key=None):
    seen = set()
    no_key = key is None
    for item in seq:
        val = item if no_key else key(item)
        if val not in seen:
            seen.add(val)
            yield item

def unique2(seq, key=None):
    seen = set()
    seen_add = seen.add
    for item in seq:
        val = item if key is None else key(item)
        if val not in seen:
            seen_add(val)
            yield item

def unique3(seq, key=None):
    seen = set()
    seen_add = seen.add
    no_key = key is None
    for item in seq:
        val = item if no_key else key(item)
        if val not in seen:
            seen_add(val)
            yield item

def unique4(seq, key=None):
    seen = set()
    seen_add = seen.add
    if key is None:
        for item in seq:
            if item not in seen:
                seen_add(item)
                yield item
    else:
        for item in seq:
            val = key(item)
            if val not in seen:
                seen_add(val)
                yield item

These are ordered from slowest to fastest. Now the benchmarks:

In [11]: L = range(1000)

In [12]: %timeit list(unique(L))
1000 loops, best of 3: 664 µs per loop

In [13]: %timeit list(unique1(L))
1000 loops, best of 3: 583 µs per loop

In [14]: %timeit list(unique2(L))
1000 loops, best of 3: 403 µs per loop

In [15]: %timeit list(unique3(L))
1000 loops, best of 3: 378 µs per loop

In [16]: %timeit list(unique4(L))
1000 loops, best of 3: 333 µs per loop

In [17]: %timeit list(cyunique(L))
10000 loops, best of 3: 131 µs per loop

In [18]: L = [1] * 1000

In [19]: %timeit list(unique(L))
1000 loops, best of 3: 308 µs per loop

In [20]: %timeit list(unique1(L))
10000 loops, best of 3: 136 µs per loop

In [21]: %timeit list(unique2(L))
10000 loops, best of 3: 198 µs per loop

In [22]: %timeit list(unique3(L))
10000 loops, best of 3: 136 µs per loop

In [23]: %timeit list(unique4(L))
10000 loops, best of 3: 95 µs per loop

In [24]: %timeit list(cyunique(L))
10000 loops, best of 3: 51.1 µs per loop

mrocklin · 2014-05-10T14:21:35Z

Wow, that's very impressive.

eriknw · 2014-05-10T14:26:15Z

Wow, that's very impressive.

Indeed, which is why I was compelled to try something as perverse as #179 !

eriknw · 2014-05-11T13:27:17Z

On the topic of avoiding attribute resolution, another place to apply this optimization is importing. For example, second is often used in a tight loop, and it would be faster if we did from itertools import islice instead of import itertools. I don't know how aggressively we should apply this optimization technique. Should we use from ... import ... for everything, or just for things that are likely to be in inner loops and whose performance is noticeably improved?

eriknw · 2014-05-14T12:59:16Z

Want to see something awesome? Running python runbench.py unique gives me the following tables to copy/paste to github (see pytoolz/cytoolz#22):

Benchmarks: benchmarkz/bench_unique.py
Functions: toolz_arena/unique.py

Time:

Bench \ Func	0	1	2	3	4
all_different (`us`)	590	466	400	351	306
all_same (`us`)	305	135	197	135	93.4
tiny (`us`)	2.8	2.77	2.72	2.83	2.64

Relative time:

Bench \ Func	0	1	2	3	4
all_different	1.93	1.52	1.31	1.15	1
all_same	3.27	1.45	2.11	1.45	1
tiny	1.06	1.05	1.03	1.07	1

Rank:

Bench \ Func	0	1	2	3	4
all_different	5	4	3	2	1
all_same	5	3	4	2	1
tiny	4	3	2	5	1

Here is the full output (note that the first half is from "verbose=True" during benchmarking, and the second half is output controlled by the user):

Using benchmark file:
    benchmarkz/bench_unique.py

Using arena file:
    toolz_arena/unique.py

bench_all_different
     590 usec - unique0 - (2^9 = 512 loops)
     466 usec - unique1 - (2^10 = 1024 loops)
     400 usec - unique2 - (2^10 = 1024 loops)
     351 usec - unique3 - (2^10 = 1024 loops)
     306 usec - unique4 - (2^10 = 1024 loops)

bench_all_same
     305 usec - unique0 - (2^10 = 1024 loops)
     135 usec - unique1 - (2^12 = 4096 loops)
     197 usec - unique2 - (2^11 = 2048 loops)
     135 usec - unique3 - (2^12 = 4096 loops)
    93.4 usec - unique4 - (2^12 = 4096 loops)

bench_tiny
     2.8 usec - unique0 - (2^17 = 131072 loops)
    2.77 usec - unique1 - (2^17 = 131072 loops)
    2.72 usec - unique2 - (2^17 = 131072 loops)
    2.83 usec - unique3 - (2^17 = 131072 loops)
    2.64 usec - unique4 - (2^17 = 131072 loops)

**Benchmarks:** benchmarkz/bench_unique.py
**Functions:** toolz_arena/unique.py

**Time:**

|     **Bench** \ **Func** | **0** | **1** | **2** | **3** |  **4**   |
| ------------------------:|:-----:|:-----:|:-----:|:-----:|:--------:|
| **all_different** (`us`) |  590  |  466  |  400  |  351  | **306**  |
|      **all_same** (`us`) |  305  |  135  |  197  |  135  | **93.4** |
|          **tiny** (`us`) |  2.8  |  2.77 |  2.72 |  2.83 | **2.64** |

**Relative time:**

|**Bench** \ **Func** | **0** | **1** | **2** | **3** | **4** |
| -------------------:|:-----:|:-----:|:-----:|:-----:|:-----:|
|   **all_different** |  1.93 |  1.52 |  1.31 |  1.15 | **1** |
|        **all_same** |  3.27 |  1.45 |  2.11 |  1.45 | **1** |
|            **tiny** |  1.06 |  1.05 |  1.03 |  1.07 | **1** |

**Rank:**

|**Bench** \ **Func** | **0** | **1** | **2** | **3** | **4** |
| -------------------:|:-----:|:-----:|:-----:|:-----:|:-----:|
|   **all_different** |   5   |   4   |   3   |   2   | **1** |
|        **all_same** |   5   |   3   |   4   |   2   | **1** |
|            **tiny** |   4   |   3   |   2   |   5   | **1** |

The files "benchmarkz/bench_unique.py" and "toolz_arena/unique.py" really are as simple as one would hope.

"benchmarkz/bench_unique.py" :

from toolz import unique

all_different = list(range(1000))
all_same = [1] * 1000
tiny = [1]


def bench_all_different():
    list(unique(all_different))


def bench_all_same():
    list(unique(all_same))


def bench_tiny():
    list(unique(tiny))

The first few lines of "toolz_arena/unique.py":

def identity(x):
    return x


def unique0(seq, key=identity):
    seen = set()
    for item in seq:
        val = key(item)
        if val not in seen:
            seen.add(val)
            yield item

I'll push this code to github soon.

mrocklin · 2014-05-14T14:41:56Z

This looks amazing. Is it a standalone project?
On May 14, 2014 5:59 AM, "Erik Welch" notifications@github.com wrote:

Want to see something awesome? Running python runbench.py unique gives me
the following tables to copy/paste to github (see pytoolz/cytoolz#22 pytoolz/cytoolz#22
):

Benchmarks: benchmarkz/bench_unique.py
Functions: toolz_arena/unique.py

Time:
Bench \ Func 0 1 2 3 4 all_different (us) 590 466 400
351 306 all_same (us) 305 135 197 135 93.4 tiny (us) 2.8 2.77
2.72 2.83 2.64

Relative time:
Bench \ Func 0 1 2 3 4 all_different 1.93 1.52 1.31
1.15 1 all_same 3.27 1.45 2.11 1.45 1 tiny 1.06 1.05 1.03 1.07
1

Rank:
Bench \ Func 0 1 2 3 4 all_different 5 4 3 2 1
all_same 5 3 4 2 1 tiny 4 3 2 5 1

Here is the full output (note that the first half is from "verbose=True"
during benchmarking, and the second half is output controlled by the user):

Using benchmark file:
benchmarkz/bench_unique.py
Using arena file:
toolz_arena/unique.py
bench_all_different
590 usec - unique0 - (2^9 = 512 loops)
466 usec - unique1 - (2^10 = 1024 loops)
400 usec - unique2 - (2^10 = 1024 loops)
351 usec - unique3 - (2^10 = 1024 loops)
306 usec - unique4 - (2^10 = 1024 loops)
bench_all_same
305 usec - unique0 - (2^10 = 1024 loops)
135 usec - unique1 - (2^12 = 4096 loops)
197 usec - unique2 - (2^11 = 2048 loops)
135 usec - unique3 - (2^12 = 4096 loops)
93.4 usec - unique4 - (2^12 = 4096 loops)
bench_tiny
2.8 usec - unique0 - (2^17 = 131072 loops)
2.77 usec - unique1 - (2^17 = 131072 loops)
2.72 usec - unique2 - (2^17 = 131072 loops)
2.83 usec - unique3 - (2^17 = 131072 loops)
2.64 usec - unique4 - (2^17 = 131072 loops)
Benchmarks: benchmarkz/bench_unique.pyFunctions: toolz_arena/unique.py
Time:
| Bench \ Func | 0 | 1 | 2 | 3 | 4 || ------------------------:|:-----:|:-----:|:-----:|:-----:|:--------:|| all_different (us) | 590 | 466 | 400 | 351 | 306 || all_same (us) | 305 | 135 | 197 | 135 | 93.4 || tiny (us) | 2.8 | 2.77 | 2.72 | 2.83 | 2.64 |
Relative time:
|Bench \ Func | 0 | 1 | 2 | 3 | 4 || -------------------:|:-----:|:-----:|:-----:|:-----:|:-----:|| all_different | 1.93 | 1.52 | 1.31 | 1.15 | 1 || all_same | 3.27 | 1.45 | 2.11 | 1.45 | 1 || tiny | 1.06 | 1.05 | 1.03 | 1.07 | 1 |
Rank:
|Bench \ Func | 0 | 1 | 2 | 3 | 4 || -------------------:|:-----:|:-----:|:-----:|:-----:|:-----:|| all_different | 5 | 4 | 3 | 2 | 1 || all_same | 5 | 3 | 4 | 2 | 1 || tiny | 4 | 3 | 2 | 5 | 1 |

The files "benchmarkz/bench_unique.py" and "toolz_arena/unique.py" really
are as simple as one would hope.

"benchmarkz/bench_unique.py" :

from toolz import unique
all_different = list(range(1000))all_same = [1] * 1000tiny = [1]

def bench_all_different():
list(unique(all_different))

def bench_all_same():
list(unique(all_same))

def bench_tiny():
list(unique(tiny))

The first few lines of "toolz_arena/unique.py":

def identity(x):
return x

def unique0(seq, key=identity):
seen = set()
for item in seq:
val = key(item)
if val not in seen:
seen.add(val)
yield item

I'll push this code to github soon.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/178#issuecomment-43076637
.

eriknw · 2014-05-14T15:20:07Z

This looks amazing. Is it a standalone project?

It sure is!

Below shows a basic "runbench.py" file. By convention, we look for "benchmark" and "arena" directories in the same directory as "runbench.py", but other paths may be used instead via keyword arguments. Searching for benchmarks and functions to run in those benchmarks doesn't import (and, hence, run) any external Python code, and the user will have a chance to review these and remove or add any files or functions of their choosing after a BenchFinder gets created.

from benchtoolz import BenchFinder, BenchRunner, BenchPrinter

if __name__ = '__main__':
    benchfinder = BenchFinder(name, cython=False)
    benchrunner = BenchRunner(benchfinder)
    results = benchrunner.run()
    benchprinter = BenchPrinter(results)

    # perhaps we should provide a less ugly way to do this...
    for (benchfile, arenafile), table in sorted(benchprinter.tables.items()):
        gfm_times = benchprinter.to_gfm(table)
        gfm_reltimes = benchprinter.to_gfm(table, relative=True)
        gfm_rank = benchprinter.to_gfm(table, rank=True)
        # print stuff
        ...

mrocklin · 2014-05-15T04:15:42Z

This looks amazing. Is it a standalone project?

It sure is!

can i haz it?

mrocklin · 2014-05-15T04:17:06Z

Maybe even just seeing the code up on eriknw/benchtoolz before we "release" or whatnot.

mrocklin · 2014-05-18T18:20:31Z

Is this ready to go in?

eriknw · 2014-05-18T19:15:47Z

Is this ready to go in?

Yeah, I think it is.

Faster unique, isdistinct, merge_sorted, and sliding_window.

eriknw mentioned this pull request May 10, 2014

Faster groupby! #179

Merged

mrocklin reviewed May 10, 2014
View reviewed changes

mrocklin added a commit that referenced this pull request May 18, 2014

Merge pull request #178 from eriknw/faster_unique

189dac6

Faster unique, isdistinct, merge_sorted, and sliding_window.

mrocklin merged commit 189dac6 into pytoolz:master May 18, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster unique, isdistinct, merge_sorted, and sliding_window. #178

Faster unique, isdistinct, merge_sorted, and sliding_window. #178

eriknw commented May 10, 2014

mrocklin May 10, 2014

mrocklin commented May 10, 2014

eriknw commented May 10, 2014

mrocklin commented May 10, 2014

eriknw commented May 10, 2014

eriknw commented May 11, 2014

eriknw commented May 14, 2014

mrocklin commented May 14, 2014

eriknw commented May 14, 2014

mrocklin commented May 15, 2014

mrocklin commented May 15, 2014

mrocklin commented May 18, 2014

eriknw commented May 18, 2014

Faster unique, isdistinct, merge_sorted, and sliding_window. #178

Faster unique, isdistinct, merge_sorted, and sliding_window. #178

Conversation

eriknw commented May 10, 2014

mrocklin May 10, 2014

Choose a reason for hiding this comment

mrocklin commented May 10, 2014

eriknw commented May 10, 2014

mrocklin commented May 10, 2014

eriknw commented May 10, 2014

eriknw commented May 11, 2014

eriknw commented May 14, 2014

mrocklin commented May 14, 2014

eriknw commented May 14, 2014

mrocklin commented May 15, 2014

mrocklin commented May 15, 2014

mrocklin commented May 18, 2014

eriknw commented May 18, 2014